We collect multiple variables for a particular purpose, knowing that social complexity can hardly be directly explained with bivariate or univariate approaches. As it is difficult to visualize information with high dimensional data; you should consider summarising all these variable into a simpler form.
Let me start from the simplest.
This time, I will use the data about city safety:
# clean memory
rm(list = ls())
# location of the data
link="https://github.com/DACSS-Visual/tabular_multivar/raw/main/data/safeCitiesIndexAll.xlsx"
# 'rio' can be used to import EXCEL files:
library(rio)
safe=import(link)
These several variables are telling us information about the safety levels of some cities in the world, and are related to D_igital_, H_ealth_, I_nfrastructure_, and P_ersonal_ dimensions. For each of these dimensions, there are measures of actions taken (In), and results obtained (Out). We have 49 variables.
Would making a plot of 49 variables be a good idea? A heatmap is the usual answer.
Making a heatmap in ggplot2 requires a file in a particular shape, known as the long format.
The data frame safe is in wide format, that is, one observation per row (the most usual format):
head(safe)
## city D_In_PrivacyPolicy D_In_AwarenessDigitalThreats
## 1 Abu Dhabi 50 66.66667
## 2 Amsterdam 100 100.00000
## 3 Athens 75 100.00000
## 4 Bangkok 25 66.66667
## 5 Barcelona 100 100.00000
## 6 Beijing 75 66.66667
## D_In_PubPrivPartnerships D_In_TechnologyEmployed D_In_CyberSecurity
## 1 50 100 50
## 2 50 100 50
## 3 0 75 50
## 4 0 0 50
## 5 50 100 50
## 6 0 75 100
## D_Out_IdentityTheft D_Out_CompInfected D_Out_InternetAccess
## 1 100.00000 25 97.64526
## 2 99.74426 75 100.00000
## 3 100.00000 25 66.62939
## 4 99.99770 50 31.66104
## 5 99.95572 25 81.69309
## 6 99.99868 25 77.25884
## H_In_EnvironmentPolicies H_In_AccessHealthcare H_In_Beds_1000
## 1 62.79070 73.33333 8.265146
## 2 97.67442 100.00000 38.019671
## 3 41.86047 73.33333 38.846186
## 4 72.09302 80.00000 16.530292
## 5 81.39535 93.33333 24.795438
## 6 79.06977 86.66667 47.899860
## H_In_Doctors_1000 H_In_AccessFood H_In_QualityHealthServ H_Out_AirQuality
## 1 16.365171 99.7 75.0 44.21882
## 2 38.000482 100.0 100.0 84.44259
## 3 73.010130 100.0 62.5 85.00159
## 4 2.327545 98.2 62.5 76.48108
## 5 43.632417 100.0 100.0 84.74936
## 6 44.609262 96.2 62.5 14.76146
## H_Out_WaterQuality H_Out_LifeExpectY H_Out_InfMortality H_Out_CancerMortality
## 1 91.34471 74.89746 93.15708 83.41277
## 2 94.49802 90.42099 97.35614 17.10638
## 3 87.64892 89.97567 96.73406 45.94894
## 4 79.33302 63.94232 86.00311 62.23830
## 5 78.28508 96.65553 96.88958 42.75745
## 6 85.95152 69.10362 88.02488 36.76596
## H_Out_AttacksBioChemRad I_In_EnforceTransportSafety
## 1 100 100.0
## 2 100 70.0
## 3 100 60.0
## 4 100 52.5
## 5 100 82.5
## 6 100 77.5
## I_In_PedestrianFriendliness I_In_QualityRoad I_In_QualityElectricity
## 1 100 100 75
## 2 100 100 100
## 3 100 50 75
## 4 50 25 100
## 5 100 100 100
## 6 50 75 100
## I_In_DisasterManagement I_Out_DeathsDisaster I_Out_VehicularAccidents
## 1 75 100.00000 95.69134
## 2 100 99.73027 96.56438
## 3 75 98.35766 97.16392
## 4 75 97.16755 90.55944
## 5 100 99.41487 90.18464
## 6 50 96.35063 99.65628
## I_Out_PedestrianDeath I_Out_LiveSlums I_Out_AttacksInfrastructure
## 1 67.95027 100.00000 100.00000
## 2 100.00000 94.19238 100.00000
## 3 84.87354 88.56624 91.48936
## 4 68.25461 54.62795 70.21277
## 5 93.83342 100.00000 100.00000
## 6 42.12348 54.26497 100.00000
## P_In_PoliceEngage P_In_CommunityPatrol P_In_StreetCrimeData P_In_TechForCrime
## 1 100 100 100 0
## 2 100 100 100 100
## 3 50 0 100 100
## 4 50 100 0 100
## 5 100 100 100 100
## 6 50 100 100 100
## P_In_PrivateSecurity P_In_GunRegulation P_In_PoliticalStability
## 1 100 47.61905 55
## 2 100 68.25397 80
## 3 100 31.74603 70
## 4 100 41.26984 40
## 5 100 66.66667 70
## 6 100 100.00000 45
## P_Out_PettyCrime P_Out_ViolentCrime P_Out_OrganisedCrime P_Out_Corruption
## 1 75 100 75 66
## 2 50 75 100 83
## 3 75 100 50 44
## 4 50 75 50 35
## 5 75 75 75 58
## 6 50 75 50 40
## P_Out_DrugUse P_Out_TerroristAttacks P_Out_SeverityTerrorist
## 1 86.75869 99.85724 99.98014
## 2 64.62168 99.92862 100.00000
## 3 92.51022 79.65739 99.28486
## 4 92.43354 89.50749 88.55781
## 5 65.18405 100.00000 100.00000
## 6 100.00000 99.78587 98.64919
## P_Out_GenderSafety P_Out_PerceptionSafety P_Out_ThreaTerrorism
## 1 96.11598 84.49 75
## 2 96.88446 67.40 75
## 3 92.62419 49.28 75
## 4 87.88981 51.44 25
## 5 96.81837 60.35 75
## 6 96.55779 58.35 100
## P_Out_ThreatMilitaryConf P_Out_ThreatCivUnrest
## 1 100 75
## 2 100 75
## 3 75 50
## 4 50 25
## 5 100 75
## 6 75 75
This is how we produce a long format:
library(reshape2)
safeL=melt(safe, # all the data
id.vars = 'city') # unique id per row
head(safeL)
## city variable value
## 1 Abu Dhabi D_In_PrivacyPolicy 50
## 2 Amsterdam D_In_PrivacyPolicy 100
## 3 Athens D_In_PrivacyPolicy 75
## 4 Bangkok D_In_PrivacyPolicy 25
## 5 Barcelona D_In_PrivacyPolicy 100
## 6 Beijing D_In_PrivacyPolicy 75
The melt function changed the direction of the data: the variables with the values were sent into rows, and the values are now to their right.
Now, the heatmap using this format:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
base = ggplot(data = safeL,
aes(x = variable,
y =city))
heat1= base + geom_tile(aes(fill = value))
heat1
As interesting emerges, let’s also reorder rows (cities) and rows (indicators) by the median:
base= ggplot(data = safeL, aes(x = reorder(variable,
value, median),
y =reorder(city,
value, median)))
heat2= base + geom_tile(aes(fill = value))
heat2
Here you can see what rows have higher or lower colors on what set of variables. You can add color pallette:
#inverse color -1
heat2_recolor = heat2 +
scale_fill_gradient(low = 'grey30',
high = "grey90"
)
heat2_recolor
The column names need some work:
heat2_axisGood= heat2_recolor +
theme(axis.text.x = element_text(angle = 60,
hjust = 1,
size = 6))
heat2_axisGood
This is still hard to read. An alternative could be to average each dimension. Let me show you this example:
library(magrittr)
#notice I do not need the 'city':
safe[,c(grep("D_In_", names(safe) ))]%>%head()
## D_In_PrivacyPolicy D_In_AwarenessDigitalThreats D_In_PubPrivPartnerships
## 1 50 66.66667 50
## 2 100 100.00000 50
## 3 75 100.00000 0
## 4 25 66.66667 0
## 5 100 100.00000 50
## 6 75 66.66667 0
## D_In_TechnologyEmployed D_In_CyberSecurity
## 1 100 50
## 2 100 50
## 3 75 50
## 4 0 50
## 5 100 50
## 6 75 100
#averaging only numbers, as 'city' is not in data:
apply(safe[,c(grep("D_In_",names(safe) ))], #data
1, #by row
mean)%>% #function to apply by row to data
round(2) # round the result
## 1 2 3 4 5 6 7 8 9 10 11
## 63.33 80.00 60.00 28.33 80.00 63.33 41.67 80.00 66.67 26.67 46.67
## 12 13 14 15 16 17 18 19 20 21 22
## 46.67 100.00 100.00 63.33 26.67 58.33 80.00 21.67 100.00 60.00 28.33
## 23 24 25 26 27 28 29 30 31 32 33
## 58.33 58.33 35.00 58.33 41.67 58.33 80.00 100.00 80.00 31.67 90.00
## 34 35 36 37 38 39 40 41 42 43 44
## 58.33 83.33 41.67 63.33 100.00 78.33 73.33 48.33 43.33 48.33 80.00
## 45 46 47 48 49 50 51 52 53 54 55
## 100.00 58.33 43.33 66.67 53.33 95.00 75.00 90.00 88.33 16.67 88.33
## 56 57 58 59 60
## 90.00 90.00 68.33 16.67 75.00
safe$Digi_I=apply(safe[,c(grep("D_In_",names(safe) ))],1,mean)%>%round(2)
Now let’s do the rest:
safe$Digi_O=apply(safe[,c(grep("D_Out_",names(safe) ))],1,mean)%>%round(2)
safe$Health_I=apply(safe[,c(grep("H_In_",names(safe) ))],1,mean)%>%round(2)
safe$Health_O=apply(safe[,c(grep("H_Out_", names(safe) ))],1,mean)%>%round(2)
safe$Infr_I=apply(safe[,c(grep("I_In_", names(safe) ))],1,mean)%>%round(2)
safe$Infr_O=apply(safe[,c(grep("I_Out_", names(safe) ))],1,mean)%>%round(2)
safe$Perso_I=apply(safe[,c(grep("P_In_", names(safe) ))],1,mean)%>%round(2)
safe$Perso_O=apply(safe[,c(grep("P_Out_", names(safe) ))],1,mean)%>%round(2)
Let’s subset the previous data frame so that a new one has all the Input variables:
safeINS=safe[,c(1,grep("_I$", colnames(safe)))] # '$' for 'end with'.
head(safeINS)
## city Digi_I Health_I Infr_I Perso_I
## 1 Abu Dhabi 63.33 55.91 90.0 71.80
## 2 Amsterdam 80.00 78.95 94.0 92.61
## 3 Athens 60.00 64.93 72.0 64.54
## 4 Bangkok 28.33 55.28 60.5 61.61
## 5 Barcelona 80.00 73.86 96.5 90.95
## 6 Beijing 63.33 69.49 70.5 85.00
Let’s rename the vars in this data set:
names(safeINS)=c("city",'DIGITAL','HEALTH','INFRASTR','PERSONAL')
Remember we need to reshape:
safeINS_long=melt(safeINS,id.vars = 'city')
Let’s redo our heat map:
base= ggplot(data = safeINS_long,
aes(x = reorder(variable,
value,
median),
y =reorder(city,
value,
median)))
base + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = 'grey90',
high = "grey50") +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
size = 8),
axis.text.y = element_text(size = 6))
When you have this limited amount of variables, you may try a radar plot:
library(ggforce)
base = ggplot(safeINS_long,
aes(x = variable,
y = value,
group = city)) + #new
geom_shape(fill = 'gray',
col='orange')
radar1 = base + coord_polar()
radar1_re = radar1 + facet_wrap(~reorder(city,-value, median),ncol = 10)
radar1_re = radar1_re + theme(axis.text.x = element_text(size = 8))
radar1_label = radar1_re + theme(legend.position="none",
strip.text = element_text(size = 12, colour = 'blue')) #here!!!
radar1_label
You could add extra customization if wanted:
### arguments
newBackGroundGrid=element_rect(fill = "white",
colour = "red",
linewidth = 0.5,
linetype = "dashed")
newBackLineGrid=element_line(linewidth = 1,
linetype = 'solid',
colour = "lightblue")
### more customization
radar1_label+ theme(panel.background = newBackGroundGrid,
panel.grid.major = newBackLineGrid)
The colors above are not the best choice, I just used them for you to notice where to make changes. Keep in mind that areas are difficult to compare, so the plots above might be used with care (show not all the cities?).
The idea behind the radar plot can be represented by simpler plots:
ggplot(data=safeINS_long,
aes(x=reorder(city, value,median),
y=value)) + theme_classic() +
geom_point(shape=5)+
geom_segment(aes(x=city,
y=0,
yend = value,
xend = city),
color='grey',linewidth=0.2) +
facet_grid(~variable) + coord_flip() +
theme(axis.text.y = element_text(size = 5))
Finally, let me produce a plot using all the input variables, but reducing their dimensionality combining clustering and multidimensional scaling.
You can use multidimensional scaling to represent all the variables in a two dimensional plot. This will create a map that will allow you to visualize “neighborhoods of similarity” of the cases (cities). With clustering, you will use the variables (columns) to define homogeneous groups among the cases (rows).
Let’s see:
allIN=safe[,c(grep("_In_", names(safe) ))]
names(allIN)
## [1] "D_In_PrivacyPolicy" "D_In_AwarenessDigitalThreats"
## [3] "D_In_PubPrivPartnerships" "D_In_TechnologyEmployed"
## [5] "D_In_CyberSecurity" "H_In_EnvironmentPolicies"
## [7] "H_In_AccessHealthcare" "H_In_Beds_1000"
## [9] "H_In_Doctors_1000" "H_In_AccessFood"
## [11] "H_In_QualityHealthServ" "I_In_EnforceTransportSafety"
## [13] "I_In_PedestrianFriendliness" "I_In_QualityRoad"
## [15] "I_In_QualityElectricity" "I_In_DisasterManagement"
## [17] "P_In_PoliceEngage" "P_In_CommunityPatrol"
## [19] "P_In_StreetCrimeData" "P_In_TechForCrime"
## [21] "P_In_PrivateSecurity" "P_In_GunRegulation"
## [23] "P_In_PoliticalStability"
rownames(allIN)=safe$city
dist_in_safe=dist(allIN)
With some many data, it would be useful to divide the cities into clusters. Let’s decice based on the dendogram.
plot(hclust(dist_in_safe, method = 'ward.D2'),cex=0.6)
We could ask for 5 clusters:
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
res.agnes<- hcut(dist_in_safe, k = 5,hc_func='agnes',hc_method = "ward.D")
allIN$agnes=as.factor(res.agnes$cluster)
theMap=cmdscale(dist_in_safe,k = 2)
head(theMap,10)
## [,1] [,2]
## Abu Dhabi -5.2101475 -41.537274
## Amsterdam -88.8291715 0.430404
## Athens 4.7336749 -53.260067
## Bangkok 96.8490858 22.620966
## Barcelona -83.2920752 1.105904
## Beijing -0.2189441 -7.133945
## Bogota 60.5252561 12.102868
## Brussels -75.0830797 -12.969968
## Buenos Aires -32.2039804 -58.859438
## Cairo 83.7611273 -10.429404
Add them to data frame
#add to dataframe
allIN$dim1=theMap[,1]
allIN$dim2=theMap[,2]
Let plot the countries using dim1 and dim2 as coordinates:
library(ggplot2)
base=ggplot(allIN,aes(x=dim1,y=dim2,
color=agnes,
label = rownames(allIN)))
cityPoints=base + geom_point()
cityPoints + geom_text()
Let’s use ggrepel to improve labelling:
library(ggrepel)
cityPointsText=cityPoints + theme_void() +
geom_text_repel(size=1.5,
max.overlaps = 20)
cityPointsText
Would you like your own colors? …becareful!
cityPointsText+scale_color_manual(values = c("orange",'lightgreen','red','magenta','grey50'))
Bring your Dashboard with the previous plots that are due next week!